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Abstract 

Exact parsing with finite state automata is deemed in- 
appropriate because of the unbounded non-locality lan- 
guages overwhelmingly exhibit. We propose a way to 
structure the parsing task in order to make it amenable 
to local classification methods. This allows us to build a 
Dynamic Bayesian Network which uncovers the syntac- 
tic dependency structure of English sentences. Experi- 
ments with the Wall Street Journal demonstrate that the 
model successfully learns from labeled data. 



Introduction 



an mi- 
science 



Bayesian graphical models have become 
p ortant explanatory str a tegy in cognitive 
((Knill and Richards, 19 96|,(|Kording and Wolpert, 2004| ), 
(Stocker and Simoncelli, 2005 )). Recent work strongly 
supports their biological plausibility in gene ral and tha t 
of dynamic Bayesian models in particular ( |Rao, 2005] ). 
Dynamic models are geared towards prediction and classi- 
fication of sequences. As such, they are naturally suitable 
for language modeling and have already been aplied to 
tasks like speech recogni tion (|Livescu et al., 20 03) and 
part-of- speech tagging (Pesh kin et al., 2003] T However, 
grammar learning and parsing with such models generally 
appears out-of-reach, because of their Markovian character. 

Markov models restrict possible dependencies to a 
bounded, local context. At one extreme, the context is con- 
fined to the symbol occupying the current position in the 
sequence (order-0 or unigram models). In more relaxed ver- 
sions, context may include a fixed number of positions be- 
fore the current symbol (k-order), typically no more than 
three (trigram models). The restricted space of possible de- 
pendencies allows transition probabilities to be infered from 
the data and stored in a look-up table with relatively little 
technical sophistication. 

Not surprisingly however, the restricted space of rep- 
resentable dependencies is also the main disadvantage of 
Markov models in syntax-related tasks like parsing. Syntac- 
tic dependencies in natural language are unboundedly non- 
local, in the sense that no fixed amount of context is guar- 
anteed to contain the members of a given constituent. For 
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example, consider the sentences in examples (Q]- [3]). In the 
first sentence, the subject king and verb bought are adjacent 
to one another. Thus, the dependecy between them would be 
captured by a bigram (order- 1) model. However, the same 
model would be unable to represent the dependendency in 
the second example, because the subject and verb are sep- 
arated by two words. To capture this dependency, we need 
a 3rd-order Markov model. Similarly, the 3rd-order model 
would prove inadequate for the third example, where the 
subject and verb are separated by four words. 

( 1 ) The king bought a camel. 

(2) The king of Prussia bought a camel. 

(3) The king of some strange country bought a camel. 

Our solution to this problem relies on representing sen- 
tences with non-local dependencies like (12 [3]) as derived 
from their local dependency variants, akin to (Q]). This in- 
tuition is based on the formal notion that a string with 
non-local dependency is obtained from a dependency tree 
via a recursive linearization procedure. The string obtained 
at each step of the linearization procedure contains new 
local dependencies, which push apart local dependencies 
from previous levels. This way of conceptualizing the lin- 
earization of syntactic structure allows us to use a Dynamic 
Bayesian Network despite its Markov properties. We con- 
struct a DBN parser which decides only on local attach- 
ments. We then call the parser recursively to uncover the un- 
derlying dependency tree. Our results show that the model 
captures grammatical knowledge for all levels of the deriva- 
tion. The biological plausibility and remarkable compact- 
ness of learned representation may suggest that parsing in 
the brain is accomplished in a similar manner. 

Dependency grammar 

Tree-based linguistic representations of natural language 
syntax treat non-local dependencies as local in the two- 
dimensional tree structure, of which the string is a one di- 
mensional projection. The dependency grammar representa- 
tion of (0} captures the dependency between the subject, the 
object and the verb, and the dependency between the deter- 
miners and their respective nouns (Figure [T]). 



bought 



king camel 
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Prussia 

Fig. 1 : Dependency structure of example © 



More formally, a dependency grammar consists of a lex- 
icon of terminal symbols (words), and an inventory of de- 
pendency relations specifying inter-lexical requirements. A 
string is generated by a dependency grammar if and only if: 

• Every word but one (ROOT) is dependent on another 
word. 

• No word is dependent on itself either directly or indirectly. 

• No word is dependent on more than one word. 

• Dependencies do not cross. 

In a dependency tree, each word is the mother of its depen- 
dents, otherwise known as their HEAD. To linearize the de- 
pendency tree in Figure \T\ into a string, we introduce the de- 
pendents recursively next to their heads: 
Step I: bought 
Step II: king bought camel 
Step III: The king of bought a camel 
Step IV: The king of Prussia bought a camel. 

Recursive parsing as local classification 

Parsing in the dependency grammar framework is the task 
of uncovering the dependency tree given the sentence. Sup- 
pose that instead of searching for a complete parse given 
a complete sentence, we restricted our task to compressing 
the string up the linearization path. Note that linearization 
is essentially dependency parsing in reverse. In other words, 
we can uncover the dependency structure by labeling the lo- 
cal head-dependent relationships at the bottom linearization 
level (i.e. the sentence) and erasing from the string the words 
whose heads are already found. We recursively process the 
output until the root level. Thus, if as a first step in parsing 
©, we pick the head of Prussia to be the preposition of, we 
can compress the string to a form virtually equivalent to lin- 
earization Step III. Picking king as the head of the preposi- 
tion leads us to compress the string further, to the equivalent 
of step II. To compress the string, we must simply identify 
which words in the string occupy a position adjacent to their 
heads. 

The attractive feature of this representation is that the 
parsing decisions taken at each step are local. Hence, pars- 
ing can be converted into a local classification task. The 
task is to chose the best sequence of labels denoting lo- 
cal dependency relationships (links). At each position, we 
choose between setting the link to LEFT, RIGHT, or NONE, 
where left/right means the word is dependent on its 
left/right neighbor. NONE means the search for this word's 
head should be postponed until later stages of compression. 
The output of the classifier is a labeled string, which can 
be compressed by removing linked dependents. It is fed 



through recursively, until the string is compressed to the 
ROOT. 

The Dynamic Bayesian Network classifier 

The first step towards building the classifier is coming up 
with a feature representation. We will briefly motivate the 
choice of feature set with linguistic arguments. It is easy to 
determine that the linking pattern of a word depends on its 
part of speech (PoS) and the part of speech of its neighbor. 
For example, English determiners only link to the right, and 
adverbs link almost exclusively to verbs. However, the parts 
of speech alone are not sufficient to determine linking be- 
havior. In some cases, the identity of the adjacent word is re- 
quired - bought accepts links from nouns to the right, while 
slept does not. 

Another decisive factor is how many dependents the cur- 
rent word has acquired so far. Since once the current word is 
linked it will become unavailable as a future linking target 
to other words, it is important to acertain that its valency 
has already been satisfied. Valency refers to the minimal 
number of dependents a word actively seeks to license. In 
English and other SVO languages, the word has particular 
requirements with respect to the number of left and right 
dependents. Thus, in our feature representation, valency is 
indirectly captured by two variables, which reflect the num- 
ber of dependents which had already been linked to the cur- 
rent word from either side - LEFT and RIGHT COMPOSITE 
(COMP). The COMP variables affect not only the linking be- 
havior of the current token, but that of its neighbor as well. 
If the word has already received many dependents from one 
side, the probability of accepting yet another one becomes 
smaller, since its valency is already satisfied. 

Finally, the current label depends on the labels of its 
neighbor, because if the previous label is RIGHT, then the 
current label cannot be left, and if the next label is left, 
the current label cannot be RIGHT. Thus, our full feature rep- 
resentation consists of the word and its PoS tag, the words 
and PoS tags of its neighbor, the two valencies of the current 
word, the right valency of its left neighbor and the left va- 
lency of its right neighbor, as well as the neighboring links. 

The Word and Next Word feature vocabulary contain the 
2500 most frequent words in the data. An additional value 
was allocated for all remaining out-of- vocabulary words. 
The PoS, and Next PoS vocabulary contain 36 of the origi- 
nal 45 Penn Treebank Tagset, after all punctuation PoS tags 
were removed. The left and right COMP features had tree 
values: NONE, ONE and many. 

This feature representation is used as the basis of the Dy- 
namic Bayesian Network (dbn). After we briefly introduce 
the essential aspects of dbns, we wil expand on the struc- 
ture of the network for parsing. For more information on the 
general c lass of models, we refer the reader to a recent dis- 
sertation (Murphy, 2002) for an excellent survey. 



General notes on DBNs 

A DBN is a Bayesian network unwrapped in "time" (i.e. over 
a sequence), such that it can represent dependencies between 
variables at adjacent position. More formally, a DBN consists 



of two models B° and B + , where B° defines the initial dis- 
tribution over the variables at position 0, by specifying: 

• set of variables X\ , . . . , X n \ 

• directed acyclic graph over the variables; 

• for each variable Xi a table specifying the conditional 
probability of Xi given its parents in the graph 

Pr(Xi\Par{Xi}). 

The joint probability distribution over the initial state is: 

n 

Pr(X 1: ...,X n ) =l[Pr(X i \Par{X i }). 

1 

The transition model B + specifies the conditional probabil- 
ity distribution (CPD) over the state at time t given the state 
at time t—1. B + consists of: 

• directed acyclic graph over the variables X\ , . . . , X n and 
their predecessors Xf , . . . , X~ — roots of this graph; 

• conditional probability tables Pr(JQ | Par {Xi}) for all 
Xi (but not Xp. 

The transition probability distribution is: 

n 

Pr(X 1 ,...,X n \x^,...,X-) = l[Pv(X i \Par{X i }). 

1 

Together, B° and B+ define a probability distribution over 
the realizations of a system through time, which justifies 
calling these BNs "dynamic". In our setting, the word's in- 
dex in a sentence corresponds to time, while realizations of 
a system correspond to correctly tagged English sentences. 
Probabilistic reasoning about such system constitutes infer- 
ence. 

Standard inference algorithms for DBNs are similar to 
those for HMMs. Note that, while the kind of DBN we con- 
sider could be converted into an equivalent HMM, that would 
render the inference intractable due to a huge resulting state 
space. In a DBN, some of the variables will typically be 
observed, while others will be hidden. The typical infer- 
ence task is to determine the probability distribution over 
the states of a hidden variable over time, given time series 
data of the observed variables. This is usually accomplished 
using the forward-backward algorithm. Alternatively, we 
might obtain the most likely sequence of hidden variables 
using the Viterbi algorithm. These two kinds of inference 
yield resulting LINK tags. 

Learning the parameters of a DBN from data is gener- 
ally accomplished using the EM algorithm. However, in our 
model, learning is equivalent to collecting statistics over 
cooccurrences of feature values and link labels. This is im- 
plemented in GAWK scripts and takes minutes on a large cor- 
pus. While in large DBNs, exact inference algorithms are in- 
tractable, and are replaced by a variety of approximate meth- 
ods, the number of hidden state variables in our model is 
small enough to allow exact algorithms to work. For the in- 
ference we use the stand ard algorithms, as impleme nted in a 
recently released toolkit (Bilmes and Zweig, 2 002 ). 



Structure of the DBN parser 

Each slice of our DBN parser is a representation of the joint 
probability distribution of WORD, POS, LEFT/RIGHT COMP, 
and the hidden variable LINK Figured In our model, the link 
determines the value of all variables and they are indepen- 
dent of one another. Of course, this is not truly the case, but 
among those variables LINK is the only unobserved, hence 
modeling all other dependencies is inconsequential. In addi- 
tion to the intra- slice dependencies, we model dependencies 
between the current, previous and next position. The LINK 
variable infulences all aforementioned variables in neigh- 
boring slices. Finally, we introduce a CONTROL variable 
which deterministically ensures that at least one link in the 
sequence will be set to something other than NONE. This 
forces the parser to trully compress the string at each recur- 
sive parsing step. 



Word 




PoS 
/ LeftComp 
RightComp 



slice K 



slice K+1 



Fig. 2: The parsing DBN. 



Experiments and results 

For the results presented here we used the WSJ10 cor- 
pus ( [Klein an d Manni ng720Q4|). It i s a subset of the WSJ 
Penn Treebank (( [Marcus et al., 1993| )), consisting of all sen- 
tences shorter than eleven words with punctuation removed 
Q. The dependency annotation was obtained through auto- 
matic conversion of the original treebank annotation. The 
relatively short sentences make this corpus a good approx- 
imation to casual speech and limit the effects of misattach- 
ments due to the conversion. 



Encoding 

The corpus was encoded in our feature representation as 
follows. For each sentence, a number of feature files were 



the dot in our figures stands for an abstract ROOT symbol 



produced containing the feature representation of the sen- 
tence at each linearization level. The encoding of an actual 
sentence- structure pair from our corpus (Figure 0, is illus- 
trated in Figures IH to [TJ 



Link : 

Word 
PoS 
LeftComp 
RightComp 



Right _ 



Left 



predecessor suffered breakdown . 

noun verb noun root 

2 2 





Fig. 6: Third layer representation. 



her immediate predecessor suffered a nervous breakdown 



Fig. 3: Dependency structure. 

At the lowest level, no word has any discovered depen- 
dents, hence the COMP values are zero everywhere. All links 
of words whose heads are not adjacent are labeled NONE 
(0). 

At the next level, words whose labels were left or right 



Link : 

Word : 

PoS: 
LeftComp : 
RightComp : 



Right 



Right 



her immediate predecessor suffered a nervous breakdown . 

pron adj noun verb det adj noun root 

00 00000 

00 00000 



Fig. 4: First layer representation. 

are removed from the structure and the COMP counters for 
their head are incremented. 



Link : 

Word 
PoS 
LeftComp 
RightComp 



Right Right 

her predecessor suffered a breakdown . 

pron noun verb det noun root 

1 10 





Fig. 5: Second layer representation. 
The same procedure produces the subsequent levels (Figures 

Testing 

The corpus was split randomly 9:1 into a training and test- 
ing section. In training mode, the DBN was given all levels 
with the correct labels. It was trained directly on the annota- 
tions, with no additional smoothing. The result achieved was 
79% correct link attachment for directed dependencies, and 
82% for undirected. We compare the results to two baselines 
given for this corpus by ( Klein and Manning, 2004 ), TableQ] 
More detailed results for our model are shown in Table [2] 

The results unequivocally surpass the random baseline 
and the best available heuristic, which amounts to linking 
every word to its right neighbor. This suggests our model has 
learned at least some of the non-trivial dependencies which 
govern the choice of link structure. The minimal difference 



Link : 

Word 
PoS 
LeftComp 
RightComp 



Right _ 

suffered 

verb root 
1 
1 



Fig. 7: Top layer representation. 



Tab. 1 : DBN results against baseline. 



Model 


Accuracy 






Dir 


Undir 


DBN 


79 


82 


Random 


30 


46 


Adjacent heuristic 


34 


57 



Tab. 2: Detailed results for the DBN. 



Measure 


Accuracy 


Root dependency 


83 


Non-root dependency 


78 


Out-of- Vocabulary 


75 


Sentence 


36 



between the vocabulary and out-of-vocabulary scores imply 
that the network can recover the syntactic properties of an 
unknown word in context. The fact that the root accuracy is 
higher than the non-root accuracy allows us to conclude that 
the network correctly learns to postpone decisions about the 
root word in all cases, and about its dependent in most cases. 

Discussion 

Our results show that combining a DBN model with recur- 
sive application is a reasonable parsing strategy. This opens 
the door to the hypothesis that Bayesian inference is a pos- 
sible mechanism for parsing in the brain, despite the Marko- 
vian properties of the corresponding dynamic models. The 
high ROOT accuracy suggests that the model has captured 
some fundamental principles defining the local dependency 
structure at all levels of the derivation. We take this result 
as evidence that graphical models with Markov properties 
are capable of handling unbounded non-local dependencies 
through recursive calls on their own output. The implica- 
tion of this finding transcend Bayesian graphical models and 
speak to the general issue of how relevant other biologically 
plausible Markov models can be to language processing and 



learning. For example, Elman networks have been criticized 
for their a priori limitation in handling unbounded depen- 
dencies (Fran k et al., 20 05 ). It is possible that such type of 
models may be adapted to discover locality in the hierarchi- 
cal structure through recursive application. 

One exciting implication of this hypothesis is the 
domain-generality of Bayesian inference and learn- 
ing mechanisms. Previous work has proposed that 
these mechanisms are i nvolved in visual perce ption 
flKnill and Richards, 1996|), dKersten and Yuille, 2003] ), mo- 
tor contr ol ([Ko rding and Wolper t7"2004|), and attention mod- 
ulation ( |Yu and Day an, 2005| ). dKersten and Yuille, 2003] ) 
proposes Bayesian graphical model of object detec- 
tion which rely on estimating hidden variables such 
as relative depth and 3-D structure from observables 
they influence -shadow displacement, 2-D projection. 
( [Kording and Wolpert, 2004| ) suggests that subjects ina 
sensory-motor experiment internally represent both the 
statistical distribution of the task and their sensory un- 
certainty, combining them in a manner consistent with a 
performance-optimizing bayesian process. In our work, the 
hidden links are estimated from observable word and PoS, 
along with a prior label distibution. 

The parallelism in the proposed cognitive strategies for 
all these different modalities may shed light on the issue 
whether and how modular the language faculty is. The mod- 
ularity hypothesis states that the cognitive mechanisms un- 
derlying linguistic competence are specific to language. If 
Bayesian inference proves to be a plausible uniting principle 
behind visual, motor and linguistic abilities, this hypothesis 
is seriously undermined. At the same time, it is important to 
note that the generality of the mechanism does not necessar- 
ily negate the modularity of language completely. The fea- 
ture representation which our model used already encodes 
language-specific knowledge. Further research is needed to 
determine whether the feature representation and the struc- 
ture of the network can be induced through structure learn- 
ing algorithms. 

Our approach is particularly appealing in light of recent 
work suggesting that Bayesian type inference is biologi- 
cally plausible. ( [Rao, 2 005 ) shows that recurrent networks 
of noisy integrate-and-fire neurons can perform approximate 
Bayesian inference for dynamic and hierarchical graphical 
models. According to him, the membrane potential dynam- 
ics of neurons corresponds to approximate belief propaga- 
tion in the log domain, and the spiking probability of a neu- 
ron approximates the posterior probability of the preferred 
state encoded by the neuron, given past inputs. This seems 
to suggest that our parsing model can be implemented in 
a neural circuit. Furthermore, since the same DBN is used 
to uncover local dependencies throughout all levels of the 
derivation, such implementation would address Humboldt's 
characterization of language as a system that makes "infi- 
nite use of finite means" at the neurophysiological level. The 
same neural aparatus could be used to recursively uncover 
the dependency structure of a sentence level by level. 

Another implication of our work is that the nature of 
the processing architecture may constrain the kind of gram- 
mar human languages permit. If indeeed parsing is accom- 



plished through recursive processing of the output of previ- 
ous stages, some types of long-distance depndencies would 
be impossible to detect. In particular, if the material inter- 
vening between a head-dependent pair (H, D) is not a con- 
stituent whose own head depend on either H or D, our model 
would not be able to uncover it because H and D will not be 
adjacent at any point in the derivation. In other words, this 
parser is incapable of handling strictly context-sensitive lan- 
guages, to the extent that such dependencies exist, they are 
fairly limited (Shieber, 1985 ). Such cases will need to be re- 
solved through some reordering in pre-processing, possibly 
based on case marking. 

Future work 

One deficiency of our model is that decisions at lower levels 
cannot be reversed in the interest of more optimal choices 
at higher levels. There are however important reasons why 
this might be necessary. For example, a prepositional phrase 
subcategorized for by the verb may be mistakenly attached 
to a preceding noun phrase, leaving the verb with a missing 
dependent (0]) 

(4) The king put * [the camel in the trunk] . 

In the future, we hope to address this problem through a 
form of beam search - retaining the k-best parses at each 
level and choosing among them based on what happens at 
the next level. 

Another important issue that we need to address is the to- 
tal loss of information about the dependents that have been 
linked to a word at previous levels. Some well-known cases 
pose a problem for this aspect of our model. For example, 
the sentences in © and © are structurally distinct solely 
becase the complement of the prepositional phrase in the 
second sentence is an instrument appropriate for seeing. 

(5) The king saw [the camel with two humps] . 

(6) The king saw *[the camel with a telescope]. 

In our current model, once the complement is linked to the 
preposition, the two sentence will become identical, and one 
of them will be assigned the wrong structure. This concern 
can be addressed through introducing new variables, which 
keep track not only of the number of linked dependents but 
of their semantic category (e.g. instrument, animate etc.) 

A natural way to extend our model in a different direc- 
tion is to combine it with the Bayesian PoS tagger devel- 
oped in (Peshkin et al., 2003 ). Allowing the model to infer 
PoS tags and structure simultaneously will be a significantly 
better approximation to the parsing task humans are faced 
with. Last but not least, we would like to implement semisu- 
pervised learning. One way to do this would involve start- 
ing off with a small labeled set of sentences at all parsing 
depths, followed by presenting unparsed whole sentences. 
The parses suggested by the model would in their turn be 
used for learning in a bootstrap fashion. 



Conclusion 

In our closing remarks, we would like to emphasize several 
aspects of our parsing model which make it interesting from 
the perspective of cognitive science and brain-inspired artifi- 
cial intelligence. First, it belongs to a class of models which 
have been used recently to capture cognitive mechanisms 
in non-linguistic domains. Second, it naturally utilizes the 
overwhelming "disguised locality" of natural language syn- 
tax - in other words, it benefits from the fact that string-non- 
local dependencies are tree-local. Third, it is biologically 
plausible because it has been shown to be implementable in 
a neural circuit. And finally, it takes seriously the question 
how the finite amount of brain hardware is capable of en- 
coding structures of unbounded depth. While there is much 
room for improvement, we believe these qualities make it 
an important step on the difficult road toward understanding 
how the mind emerges from the brain. 
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